A Statistical Method for Handling Unknown Words

نویسنده

  • Alexander Franz
چکیده

Robust Natural Language Processing systems must be able to handle words that are not in their lexicon. We created a classifier that was trained on tagged text to find the most likely parts of speech for unknown words. The classifier uses a contingency table to count the observed features, and a loglinear model to smooth the cell counts. After smoothing, the contingency table is used to obtain the conditional probability distribution for classification. A number of features, determined by exploration (Tukey 1977), are used. For example, is the word capitalized? Does the word carry one of a number of known suffixes? We maximize the conditional probability of the proposed classification given the features to achieve minimum error rate classification (Duda & Hart 1973). The baseline results are provided by using only the prior probabilities P(c) (column Prior). (Weischedel et al. 1993) describe a probabilistic model with four features that are treated as independent, which we reimplemented (column 4 Indep). For comparison, we created a statistical classifier with the same four features (column 4 Class). Our best model was a classifier with nine features (column 9 Class).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hybrid POS tagging with generalized unknown-word handling

This paper presents POSTAG 1 as a statistical/rule-based hybrid part-of-speech (POS) tagging system with generalized unknown-word handling. The POSTAG integrates morphological analysis with statistical POS disambigua-tion and post rule-based error-correction. The error-correction rules are automatically learned from a tagged corpus and selectively correct standard HMM tagging errors. The morpho...

متن کامل

Handling Unknown Words in Statistical Machine Translation from a New Perspective

Unknown words are one of the key factors which drastically impact the translation quality. Traditionally, nearly all the related research work focus on obtaining the translation of the unknown words in different ways. In this paper, we propose a new perspective to handle unknown words in statistical machine translation. Instead of trying great effort to find the translation of unknown words, th...

متن کامل

Translation of unknown words in phrase-based statistical machine translation for languages of rich morphology

This paper proposes a method for handling out-of-vocabulary (OOV) words that cannot be translated using conventional phrase-based statistical machine translation (SMT) systems. For a given OOV word, lexical approximation techniques are utilized to identify spelling and inflectional word variants that occur in the training data. All OOV words in the source sentence are replaced with appropriate ...

متن کامل

A Novel Approach for Handling Unknown Word Problem in Chinese-Vietnamese Machine Translation

For languages where space cannot be a boundary of a word, such as Chinese and Vietnamese, word segmentation is always the task to be done first in a statistical machine translation system (SMT). The word segmentation increases the translation quality, but it causes many unknown words (UKW) in the target translation. In this paper, we will present a novel approach to translate UKW. Based on the ...

متن کامل

Generalized unknown morpheme guessing for hybrid POS tagging of Korean

Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with P OSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morpheme guessing is based on a combination of a morpheme pattern dictionary which encodes general le...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994